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Abstract — We present PoWerStore, the first efficient robust 
storage protocol that achieves optimal latency without using 
digital signatures. 

PoWerStore's robustness comprises tolerating asynchrony, 
maximum number of Byzantine storage servers, any number 
of Byzantine readers and crash-faulty writers, and guaranteeing 
wait-freedom and linearizability of read/write operations. Further- 
more, PoWerStore's efficiency stems from combining lightweight 
authentication, erasure coding and metadata write-backs where 
readers write-back only metadata to achieve linearizability. 

At the heart of PoWerStore are Proofs of Writing (PoW): 
a novel storage technique based on lightweight cryptography. 
PoW enable reads and writes in the single-writer variant of 
PoWerStore to have latency of 2 rounds of communication 
between a client and storage servers in the worst-case (which 
we show optimal). 

We further present and implement a multi-writer PoWerStore 
variant featuring 3-round writes/reads where the third read 
round is invoked only under active attacks, and show that it 
outperforms existing robust storage protocols, including crash- 
tolerant ones. 

I. Introduction 

Byzantine fault-tolerant (BFT) distributed protocols have 
recently been attracting considerable research attention, due 
to their appealing promise of masking various system issues 
ranging from simple crashes, through software bugs and mis- 
configurations, all the way to intrusions and malware. How- 
ever, there are many issues that render the use of existing BFT 
protocols questionable in practice; these include, e.g., weak 
guarantees under failures (e.g., ||5|, p3] , ||45|) or high cost 
in performance, deployment and maintenance when compared 
to crash-tolerant protocols pO| . This can help us derive the 
following requirements for the design of future BFT protocols: 

• A BFT protocol should be robust, i.e., it should tolerate 
actual Byzantine faults and actual asynchrony (model- 
ing network outages) while maintaining correctness and 
providing sustainable progress even under worst-case 
conditions that still meet the protocol assumptions. This 
requirement has often been neglected in BFT protocols 
that focus primarily on common, failure-free operation 
modes (e.g., ||22), ||24), ||29)). 

• A robust protocol should be efficient (e.g., p7) , |48|). 
We believe that the efficiency of a robust BFT protocol 
is best compared to the efficiency of its crash-tolerant 
counterpart. Ideally, a robust protocol should not incur 
significant performance and resource cost penalty with 
respect to a crash-tolerant implementation, hence making 
the replacement of a crash-tolerant protocol a viable 
option. 



We stand to the point that achieving these goals may 
require challenging existing approaches and revisiting the use 
of fundamental abstractions (such as cryptographic primitives). 

In this paper, we focus on read/write storage [31J , where 
a set of client (readers and writers) processes share data 
leveraging a set of storage server processes. Besides being 
fundamental, the read/write storage abstraction is also practi- 
cally appealing given that it lies at the heart of the Key- Value 
Store (KVS) APIs — a de-facto standard of modern cloud 
storage offerings. 

In this context, we tackle the problem of developing a 
robust and efficient asynchronous distributed storage protocol. 
More specifically, storage robustness implies [6J: (i) wait- 



freedom 1 25 1, i.e., read/write operations invoked by correct 
clients always eventually return, and (ii) optimal resilience, 
i.e., ensuring correctness despite the largest possible number 
t of Byzantine server failures; in the Byzantine model, this 
mandates using 3t + 1 servers [ j40[ . 

Our main contribution is PoWerStore, the first efficient 
robust storage protocol that achieves optimal latency, mea- 
sured in maximum (worst-case) number of communication 
rounds between a client and storage servers, without using 
digital signatures. Perhaps surprisingly, the efficiency of PoW- 
erStore does not come from sacrificing consistency; namely, 
PoWerStore ensures linearizability [27] (or atomicity |31 J) of 
read/write operations. In fact, the efficiency of PoWerStore 
stems from combining lightweight authentication, erasure 
coding and metadata write-backs where readers write-back 
only metadata, avoiding much expensive data write-backs. 

At the heart of PoWerStore is a novel data storage tech- 
nique we call Proofs of Writing (PoW). PoW are inspired by 
commitment schemes p3) ; PoW incorporate a 2-round write 
procedure, where the second round of write effectively serves 
to "prove" that the first round has actually been completed 
before it is exposed to a reader. More specifically, PoW rely on 
the construction of a secure token that is committed in the first 
write round, but can only be verified after it is revealed in the 
second round. Here, token security means that the adversary 
cannot predict nor forge the token before the start of the 
second round. We construct PoW using cryptographic hash 
functions and efficient message authentication codes (MACs); 
in addition, we also propose an instantiation of PoW using 
Shamir's secret sharing scheme |43 |. 

As a result, PoW allow PoWerStore to achieve 2-round 
read/write operations in the single writer (SW) case. This, 
in case of reads, matches the theoretical minimum for any 
robust atomic storage implementation that supports arbitrary 
number of readers, including crash-only implementations |15jj. 
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Notably, PoWerStore is the first robust BFT storage protocol to 
achieve the 2-round latency of reading without using digital 
signatures. On the other hand, we prove the 2-round write 
inherent, by showing a 2-round write lower bound for any 
robust BFT storage that features metadata write-backs using 
less than At + 1 storage servers. 

In addition, PoWerStore employs erasure coding at the 
client side to offload the storage servers and to reduce the 
amount of data transferred over the network. Besides being 
the first robust BFT storage protocol to feature metadata write- 
backs, PoWerStore is also the first robust BFT storage protocol 
to tolerate an unbounded number of Byzantine readers |34j and 
unbounded number of writers' crash-faults. 

Finally, while our SW variant of PoWerStore demonstrates 
the utility of PoW, for practical applications we propose 
a multi-writer (MW) variant of PoWerStore (referred to as 
M-PoWerStore). M-PoWerStore features 3-round writes and 
reads, where the third read round is invoked only under active 
attacks. M-PoWerStore also resists attacks specific to multi- 
writer setting that exhaust the timestamp domain |7|. We 
evaluate M-PoWerStore and demonstrate its superiority even 
with respect to existing crash-tolerant robust atomic storage 
implementations. Our results show that in typical settings, the 
peak throughput achieved by M-PoWerStore improves over 
existing crash-tolerant ||6) and Byzantine-tolerant [39) atomic 
storage implementations, by 50% and 100%, respectively. 

The remainder of the paper is organized as follows. In 
Section |llj we outline our system model. In Section |in] 
we introduce PoWerStore and we analyze its correctness. In 



Section IV we present the multi-writer variant of PoWerStore, 
M-PoWerStore. In Section |V] we evaluate an implementation 
of M-PoWerStore. In Section |VI] we overview related work 
and we conclude the paper in Section |VII| 



II. System Model 

We consider a distributed system that consists of three 
disjoint sets of processes: a set servers of size 5 = 3t + 1, 
where t is the failure threshold parameter, containing processes 
{si,...,S5}; a collection writers wi,W2,--- and a collection 
readers ri, r2, .... The collection clients is the uruon of writers 
and readers. We assume the data-centric model fTT), | |T2| , p4| 
where every client may asynchronously communicate with any 
server by message passing using point-to-point reliable com- 
munication channels; however, servers cannot communicate 
among each other, nor send messages to clients other than in 
reply to clients' messages. 

We further assume that each server pre-shares one 
symmetric group key with all writers in the set W; in the 
following, we denote this key by fc^. In addition, we assume 
the existence of a cryptographic (i.e., one way and collision 
resistant) hash function H{-), and a secure message authenti- 
cation function MACfe(-), where fc is a A-bit symmetric key. 

We model processes as probabilistic I/O automata (49] 
where a distributed algorithm is a collection of such pro- 
cesses. Processes that follow the algorithm are called correct. 
We assume that any number of readers exhibit Byzantine 



1 32 1 (or arbitrary p8) ) faults. Moreover, up to t servers 
may be Byzantine. Byzantine processes can exhibit arbitrary 
behavior; however, we assume that Byzantine processes are 
computationally bounded and cannot break cryptographic hash 
functions or forge message authentication codes. Finally, any 
number of writers may fail by crashing. 

We focus on a read/write storage abstraction pT| which 
exports two operations: WRlTE(u), which stores value v and 
READ(), which returns the stored value. While every client 
may invoke the READ operation, we assume that WRITES are 
invoked only by writers. We say that an operation (invocation) 
op is complete if the client receives the response, i.e., if the 
client returns from the invocation. We further assume that each 
correct client invokes at most one operation at a time (i.e., does 
not invoke the next operation until it receives the response for 
the current operation). We further assume that the initial value 
of a storage is a special value _L, which is not a valid input 
value for a write operation. 

In any execution of an algorithm, we say that a complete 
operation opi precedes operation op2 (or op2 follows opi) if 
the response of opi precedes the invocation of op2 in that 
execution. 

We focus on the strongest storage progress consistency and 
progress semantics, namely linearizability [27 j (or atomicity 
pT[ ) and wait-freedom p5) . Wait-freedom states that if a 
correct client invokes an operation op, then op eventually 
completes. Linearizability gives an illusion that a complete 
operation op by a correct client is executed instantly at some 
point in time between its invocation and response, whereas the 
operations invoked by faulty clients appear either as complete 
or not invoked at all. 

Finally, we measure the time-complexity of an atomic 
storage implementation in terms of number of communication 
round-trips (or simply rounds) between a client and servers. 
Intuitively, a round consists of a client sending the message to 
(a subset of) servers and receiving replies. A formal definition 
can be found in, e.g., ITS), p8). 



III. PoWerStore 

In this section, we provide a detailed description of the 
PoWerStore protocol and we analyze its correctness. In addi- 
tion, we show that PoWerStore exhibits optimal (worst-case) 
latency in both READ and WRITE operations. 

A. Proofs of Writing 

At the heart of PoWerStore is a novel technique we call 
Proofs of Writing (PoW). Intuitively, PoW enable PoWerStore 
to complete in a 2-round WRITE procedure, where the second 
round of WRITE effectively serves to "prove" that the data 
is written in a quorum of servers (at least S — t) before it 
is exposed to a client. As such, PoW obviate the need for 
writing-back data, enabling efficient metadata write-backs and 
support for Byzantine readers. 

PoW are inspired by commitment schemes |23]; PoW 
consist of the construction of a secure token that can only be 
verified after the second round is invoked. Here, token security 
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means that the adversary cannot predict nor forge the token 
before the start of the second round. More specifically, our 
PoW implementation relies on the use of one-way collision- 
resistant functions seeded with pseudo-random input. We 
construct PoW as follows. 

In the first round, the writer first generates a pseudo-random 
nonce and stores the hash of the nonce together with the data 
in a quorum of servers. The writer then discloses the nonce to 
the servers in the second round; this nonce provides sufficient 
proof that the first round has actually completed. In fact, 
during the first round of a READ operation, the client collects 
the received nonces into a set and sends (writes-back) this set 
to the servers in the second round. The server then verifies the 
PoW by checking whether the received nonce matches the hash 
of the stored nonce. Note that since the writer keeps the nonce 
secret until the start of the second round, it is computationally 
infeasible for the adversary to acquire the nonce unless the 
first round of WRITE has been completed. 

PoW are not restricted to the use of cryptographic hash 
functions. In Section III-F we propose another possible in- 



stantiation of PoW using Shamir's secret sharing scheme |43) . 

B. Overview of PoWerStore 

In PoWerStore, the WRITE operation completes in two 
rounds, called STORE and COMPLETE. Likewise, the READ 
performs in two rounds, called COLLECT and FILTER. For the 
sake of convenience, each round rnd e {STORE, COMPLETE, 
COLLECT, filter} is wrapped by a procedure rnd. In each 
round rnd, the client sends a message of type rnd to all 
servers. A round completes at the latest when the client 
receives messages of type rnd_ACK from S—t correct servers. 
The server maintains a variable Ic to store the timestamp of 
the last completed WRITE, and a variable LC, of the same 
structure, to maintain a set of timestamps written-back by 
chents. In addition, the server keeps a variable Hist storing 
the history of the data written by the writer in the STORE 
round, indexed by timestamp. 

C. WRITE Implementation 

The WRITE implementation is given in Algorithm [T] To 
write a value V, the write:[^ increases its timestamp ts, com- 
putes a nonce N and its hash N = H{N), and invokes STORE 
with ts, V and N. When the STORE procedure returns, the 
writer invokes COMPLETE with ts and N. After COMPLETE 
returns, the WRITE completes. 

In STORE, the writer encodes V into S fragments /r; (1 < 
i < S), such that V can be recovered from any subset of t + 1 
fragments. Furthermore, the writer computes a cross-checksum 
cc consisting of the hashes of each fragment. For each server 
Si (1 < j < S), the writer sends a STORE{ts, fri,cc, N) 
message to s^. On reception of such a message, the server 
writes {fri,cc,N) into the history entry Hist[ts] and replies 
to the writer. After the writer receives S — t replies from 
different servers, the STORE procedure returns, and the writer 
proceeds to COMPLETE. 

'Recall that PoWerStore is a single-writer storage protocol. 



1: Definitions: 

2: ts : structure num initially ts — tso = 

3: operation WRITE(V') 

4: ts + 1 

5: TV ^{0,1}^ 

6: N ^ H{N)_ 

1: STORE(ts, Af, V") 

8: COMPLETE(ts, iV) 

9: return OK 

10: procedure STORE(ts, V, A*') 

11: {/ri, . . . , /rs} ^ encode(V,t + 1, S) 

12: cc^[H{fr^),...,H{frs)] _ 

13: for 1 < i < 5" do send STORE(ts, fn, cc, N) to Si 

14: wait for STORE_ACK(ts) from S — t servers 

15: procedure C0MPLETE(ts, N) 

16: send COMPLETE{ts, A'') to all servers 

17: wait for COMPLETE_ACK(is) from S — t servers 



Algorithm I: Algorithm of the writer in PoWerStore. 



In COMPLETE, the writer sends a COMPLETE (is, A^) mes- 
sage to all servers. Upon reception of such a message, the 
server changes the value of Ic to (ts, N) if ts > lets and 
replies to the writer. After the writer receives S — t replies 
from different servers, the COMPLETE procedure returns. 

D. READ Implementation 

The READ implementation is given in Algorithm [3j it 
consists of the COLLECT procedure followed by the FILTER 
procedure. In COLLECT, the client reads the tuples {ts,N) 
included in the set LC U {Ic} at the server, and accumulates 
these tuples in a set C together with the tuples read from other 
servers. We call such a tuple a candidate and C a candidate 
set. Before responding to the client, the server garbage-collects 
old tuples in LC using the GC procedure. After the client 
receives candidates from S — t different servers, COLLECT 
returns. 

In FILTER, the client submits C to each server. Upon 
reception of C, the server performs a write-back of the 
candidates in C (metadata write-back). In addition, the server 
picks c/„, as the candidate with the highest timestamp in C 
such that valid(c/ii,) holds. The predicate valid(c) holds if the 
server, based on the history, is able to verify the integrity of 
c by checking that H{c.N) equals Hist\c.ts\.N . The server 
then responds to the client with a message including the 
timestamp Chv-ts, the fragment Hist[chvts].fr and the cross- 
checksum Hist[chvts].cc. The client waits for the responses 
from servers until there is a candidate c with the highest 
timestamp in C such that safe(c) holds, or until C is empty, 
after which COLLECT returns. The predicate safe(c) holds 
if at least t + 1 different servers Si have responded with 
timestamp c.ts, fragment /r^ and cross-checksum cc such that 
H{fri) — cc[i\. If C 7^ 0, the client selects the candidate with 
the highest timestamp c e C and restores value V by decoding 
V from the t + 1 correct fragments received for c. Otherwise, 
the client sets V to the initial value _L. Finally, the READ 
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18: Definitions: 

19: Ic : structure {ts,N), initially (tso,NULL) //last completed write 

20: LC : set of structure {ts,N), initially //set of written-back candidates 

21: Hist[. . .] : vector of (/r, cc,]V) indexed by ts, initially Hist[tSQ] = (NULL, NULL, NULL) 



22: upon receiving STORE(ts, fr, cc, TV) from the writer 

23: Hist[ts] -ir- {fr,cc,N) 

24: send STORE_ACK(is) to the writer 

25: upon receiving COMPLETE{ts, A'') from the writer 

26: if ts > lets then Ic (ts, N) 

27: send COMPLETE_ACK(ts) to the writer 

28: procedure gc() 
29: Chv Co 

30: C;i„ <~ c G LC : c=max({c £ LC : valid(c)}) 
31: if Chv-ts > lets then /c -f- Chv 

32: LC {c G LC : c.ts > lets A Hist[c.ts] = NULL} 



33: upon receiving COLLECT(tsr-) from client r 
34: GC() 

35: send COLLECT_ACK{tsr, LC U {Ic}) to client r 

36: upon receiving FILTER{tsr, C) from client r 

37: LC -i^ LCUC //write-back 

38: Chv Co 

39: Ch„ c G C ; c=max({c G C : valid(c)}) 
40: {fr,cc) (i?ist[c(,„.ts]./r-, ffist[chi,.ts].cc) 
41: send FILTER_ACK{tsr, Chv-ts, fr, cc) to client r 

42: Predicates: 

43: valid(c) = {H{c.N) = ffist[c.ts].]V) 



Algorithm 2: Algorithm of server in PoWerStore. 



returns V. 

E. Correctness Sketch 

In what follows, we show that PoWerStore satisfies lineariz- 
ability and wait-freedom in all FILTER. Due to lack of space, 
we refer the readers to the Appendix for a detailed proof of 
PoWerStore's correctness. 

We first explain why linearizability is satisfied by arguing 
that if a READ follows after a completed WRITE(F) (resp. 
a completed READ that returns V) then the READ does not 
return a value older than V. 

a) RE AD/ WRITE Linearizability: Suppose a READ 
rd by a correct client follows after a completed WRITE(T^). 
If ts is the timestamp of WRITE(1/), we argue that rd 
does not select a candidate with a timestamp lower than ts. 
Since a correct server never changes Ic to a candidate with a 
lower timestamp, after WRITE (V^) completed, t + 1 correct 
servers hold a valid candidate with timestamp ts or greater in 
Ic. Hence, during COLLECT in rd, a valid candidate c with 
timestamp ts or greater is included in C. Since c is valid, 
at least t + 1 correct servers hold history entries matching c 
and none of them respond with a timestamp lower than c.ts. 
Consequently, at most 2t timestamps received by the client 
in FILTER are lower than c.ts and thus c is never excluded 
from C. By Algorithm |3] rd does not select a candidate with 
timestamp lower than c.ts > ts. 

b) READ Linearizability: Suppose a READ rd' by a 
correct client follows after rd. If c is the candidate selected 
in rd, we argue that rd' does not select a candidate with a 
timestamp lower than c.ts. By the time rd completes, t + 1 
correct servers hold c in LC. According to Algorithm |2j if a 
correct server excludes c from LC, then the server changed Ic 
to a valid candidate with timestamp c.ts or greater. As such, 
t + 1 correct servers hold in LC U{lc} a valid candidate with 
timestamp c.ts or greater and during COLLECT in rd', a valid 
candidate c' with timestamp c.ts or greater is included in C. 



By applying the same arguments as above, rd does not select 
a candidate with timestamp lower than c' .ts > c.ts. 

We now show that PoWerStore satisfies wait-freedom; here, 
we argue that the READ does not block in FILTER while 
waiting for the candidate c with the highest timestamp in C 
to become safe(c) or C to become empty. 

c) Wait-freedom: Suppose by contradiction that rd 
blocks during FILTER after receiving FILTER ACK messages 
from all correct servers. We distinguish two cases: (Case 1) C 
contains a valid candidate and (Case 2) C contains no valid 
candidate. 

• Case 1: Let c be the valid candidate with the highest 
timestamp in C. As c is valid, at least t+1 correct 
servers hold history entries matching c. Since no valid 
candidate in C has a higher timestamp than c.ts, these 
t + 1 correct servers responded with timestamp c.ts, 
corresponding erasure coded fragment fri and cross- 
checksum cc such that H{fri) = cc[i]. Therefore, c is 
safe(c). Furthermore, all correct servers (at least S — t) 
responded with timestamps at most c.ts. Hence, every 
candidate c' E C such that c' .ts > c.ts becomes 
invalid(c') and is excluded from C. As such, c is also 
highcand(c) and we conclude that rd does not block. 

• Case 2: Since none of the candidates in C is valid, all 
correct servers (at least S — t) responded with timestamp 
tsQ, which is lower than any candidate timestamp. As 
such, every candidate c E C becomes invalid(c) is 
excluded from C. We therefore conclude that rd does 
not block. 

F. PoW based on Shamir's Secret Sharing Scheme 

In what follows, we propose an alternative construction of 
PoW based on Shamir's secret sharing scheme 1 (43 J . Here, 
the writer constructs a polynomial P(-) of degree t with 
coefficients {at, . . . , ao} chosen at random from Zq, where q 
is a public parameter. That is, P{x) = X]j=o Q^j^:^ - 
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Definitions: 

tsr: num, initially 

Q, R: set of pid, initially 

C: set of {ts, N), initially 

W[l . . . S]: vector of {ts, fr, cc), initially 



49: operation READ() 

50: C,Q,R^^ 

51: tsr ^ tsr + I 

52: C COLLECT(tsr) 

53: C FILTER(tsr, C) 

54: if C / then 

55: c c' G C : highcand(c') A safe(c') 

56: V <~ RESTORE(c.fs) 

57: else V ^ _L 
58: return V 

59: procedure coLLECT(isr) 

60: send COLLECT(isr) to all servers 

61: wait until \Q\> S -t 

62: return C 

63: upon receiving COLLECT_ACK(tsr, d) from server Si 

64: Q ^ g U {i} 

65: C C U {c G Ci : c.ts > tso} 

66: procedure FiLTER(tsr, C) 

67: send FILTER(tsr, C) to all servers 

68: wait until > 5 - 1 A 

((3c G C : highcand(c) A safe(c)) v C = 0) 
69: return C 

70: upon receiving FILTER_ACK{fsr, ts, /r, cc) from server Si 
71: R\J {iy,W[i]^ {ts,fr,cc) 

72: C\ {cG C : invalid(c)} 

73: procedure RESTORE(ts) 
74: cc ^ cc' s.t. 3i?' C _R : j_R'j > t + 1 A 
(Vi G R! : W[i].ts = tsA W[{\.cc = cc') 



F ^ {W\i].fr : i€R A W[i].ts=ts A H(W^[j]./r)=cc[i]} 
1/ ^ decode(F,t + l,^) 
return V 



78: Predicates: 

79: highcand(c) = (c.ts = max({c'.ts : c' G C})) 

80: safe(c) = 3i?' C J? : |i?'| > t + 1 A 
(Vi G 7?' : W\i].ts = c.ts) A 

(Vi,j G R':W[i\.cc=W[j].cc A H{W[i\.fr)=W[j].cc[{\' 
81: invalid(c) = |{i G i? : W[i\.ts < c.ts}| > 5 - t 

Algorithm 3: Algorithm of client r in PoWerStore. 



The writer then constructs the PoW as follows: for each 
server Si, the writer picks a random point Xi, and constructs 
the share {xi,Pi), where Pi = P{xi). As such, the writer 
constructs S different shares, one for each server, and sends 
a STORE{ts, fri,cc,Xi, Pi) message to each server Si over a 
confidential channel. Upon reception of such a message, server 
Si stores (/r^, cc, Xi, Pi) in Hist[ts]. Note that since there are 
at most t Byzantine servers, these servers cannot reconstruct 
the polynomial P( ) from their shares, even if they collude. In 
the COMPLETE round, the writer reveals the polynomial P(-) 
in a COMPLETE(<s, message. This enables a client to 

determine for a candidate c = {ts, P{-)) that the corresponding 
STORE round completed by checking that t + 1 servers Si 



stored {xi, P{xi)), without the servers revealing their share. 
For this purpose, the valid predicate at server Si changes to 
valid(c) = {c.P{Hist[c.ts].x,) = Hist[c.ts].P,). 

By relying on randomly chosen Xi, and the fact that correct 
servers never divulge their share, our construction prevents an 
adversary from fabricating a partially corrupted polynomial 
after the disclosure of P( ). To see why, note that with the 
knowledge of P(-) and Xi held by a correct server Si, the 
adversary could fabricate a partially corrupted polynomial P( ) 
that intersects with P(-) only at Xi (i.e., P{xi) — P{xi)). This 
would lead to a situation in which a candidate c is neither 
safe(c) nor invalid(c) and thus, the READ would block. 

We point that, unlike the solution based on hash functions, 
this construction provides information-theoretic guarantees for 
the PoW I4J, |j33J during the STORE round. 

G. Optimality of PoWerStore 

In this section, we prove that PoWerStore features optimal 
WRITE latency, by showing that writing in two rounds is 
necessary. We start by giving some informal definitions. 

A distributed algorithm ^ is a collection of automata p6) , 
where automaton Ap is assigned to process p. Computation 
proceeds in steps of A and a run is an infinite sequence of 
steps of A. A partial run is a finite prefix of some run. We 
say that a (partial) run r extends some partial run pr if pr is 
a prefix of r. 

We say that an implementation of linearizable storage is 
selfish, if readers write-back only metadata instead of the full 
value. Intuitively the readers are selfish because they do not 
help the writer complete a write. For a formal definition, we 
refer the readers to |16|. Furthermore, we say that a WRITE 
operation is fast if it completes in a single round. We now 
proceed to proving the main result. 

Theorem III.l. There is no fast WRITE implementation I of 
a multi-reader selfish robust linearizable storage that makes 
use of less than 4t + 1 servers. 



PreUminaries. We prove Theorem III.l by contradiction as- 
suming at most At servers. We partition the set of servers into 
four distinct subsets (we call blocks), denoted by Ti, T2, T3 
each of size exactly t, and T4 of size at least 1 and at most t. 
Without loss of generality we assume that each block contains 
at least one server We say that an operation op skips a block 
Ti, (1 < i < 4) when all messages by op to Ti are delayed 
indefinitely (due to asynchrony) and all other blocks Tj receive 
all messages by op and reply. 

Proof: We construct a series of runs of a linearizable imple- 
mentation / towards a partial run that violates linearizability. 

• Let rurii be the partial run in which all servers are correct 
except Ti which crashed at the beginning of runi. Let 
wr be the operation invoked by the writer w to write a 
value w ^ _L in the storage. The WRITE wr is the only 
operation invoked in run.i and w crashes after writing v 
to T3. Hence, wr skips blocks Ti, T2 and T^. 

• Let run\ be the partial run in which all servers are correct 
except Ti, which crashed at the beginning of run'^. In 
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Fig. 1 . Sketch of the runs used in the proof of Theorem |III.1| 



run'i, w is correct and wr completes by writing v to all 
blocks except T4, which it skips. 

Let run2 be the partial run similar to run'i, in which all 
servers except T2 are correct, but due to asynchrony, all 
messages from w to T4 are delayed. Like in run'i, wr 
completes by writing v to all servers except T4, which it 
skips. To see why, note that wr cannot distinguish run2 
from run\. After wr completes, T2 fails Byzantine by 
reverting its memory to the initial state. 
Let run-i extend runi by appending a complete READ 
rdi invoked by ri. By our assumption, / is wait-free. As 
such, rdi completes by skipping Ti (because Ti crashed) 
and returns (after a finite number of rounds) a value vr- 
Let run/i extend run2 by appending rdi. In run4, all 
servers except T2 are correct, but due to asynchrony 
all messages from ri to Ti are delayed indefinitely. 
Moreover, since T2 reverted its memory to the initial 
state, V is held only by Tj,. Note that ri cannot distinguish 
runji from run^ in which Ti has crashed. As such, rdi 
completes by skipping Ti and returns vr. By lineariz- 
ability, vr — v. 

Let run^ be similar to run^ in which all servers except 
T3 are correct but, due to asynchrony, all messages from 
ri to Ti are delayed. Note that ri cannot distinguish run^ 



from run^. As such, rdi returns vr in run^, and by 
run4, vji — V. After rdi completes, T3 fails by crashing. 
Let rung extend run^ by appending a READ rd2 
invoked by r2 that completes by returning v' . Note that 
in run^, (i) is the only server to which v was written, 
(ii) rdi did not write-back v (to any other server) before 
returning v, and T3 crashed before rd2 is invoked. 
As such, rd2 does not find v in any server and hence 
v' ^ V, violating linearizability. □ 



Notice that Theorem III. 1 applies even to implementations 
relying on self-verifying data and/or not tolerating Byzantine 
readers. Furthermore, the proof extends to crash-tolerant stor- 
age when deleting the Byzantine block T2 in the above proof; 
the result being that there is no selfish implementation of a 
multi-reader crash-tolerant linearizable storage with less than 
3t + 1 servers in which every WRITE is fast. 

IV. M-PoWerStore 



The PoWerStore protocol, as presented in Section III has 
a drawback in having potentially very large candidate sets 
that servers send to clients and that clients write-back to 
servers. As a result, a malicious adversary can exploit the 
fact that in PoWerStore candidate sets can (theoretically) grow 
without bounds and mount a denial of service (DoS) attack by 
fabricating huge sets of bogus candidates. While this attack 
can be contained by a robust implementation of the point- 
to-point channel assumption using, e.g., a separate pair of 
network cards for each channel (in the vein of (13]), this may 
impact practicality of PoWerStore. To rectify this issue, and 
for practical applications, we propose a multi-writer variant of 
our protocol called M-PoWerStore. 

A. Overview 

M-PoWerStore (Algorithms |4] |5] and |6| supports an un- 
bounded number of clients. In addition, M-PoWerStore fea- 
tures optimal READ latency of two rounds in the common 
case, where no process is Byzantine. Outside the common 
case, under active attacks, M-PoWerStore gracefully degrades 
to guarantee reading in at most three rounds. The WRITE has 
a latency of three rounds, featuring non-skipping timestamps 
I?), which prevents attacks specific to multi-writer setting that 
exhaust the timestamp domain. 

The main difference between M-PoWerStore and PoWer- 
Store is that, here, servers keep a single written-back candidate 
instead of a set. To this end, it is crucial that servers are 
able to determine the validity of a written-back candidate 
without consulting the history. For this purpose, we enhance 
our original PoW scheme by extending the candidate with 
message authentication codes (MACs) to authenticate the 
timestamp and the nonce's hash, one for each server, using 
the corresponding group key. As such, a valid MAC proves 
to a server that the timestamp has been issued by a writer 
in COMPLETE, and thus, constitutes a PoW that a server can 
obtain even without the corresponding history entry. Note that 
in case of a candidate incorporating corrupted MACs, servers 
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might disagree about the vaUdity of a written-back candidate. 
Hence, a correct cHent might not be able to write-back a 
candidate to i + 1 correct servers as needed. To solve this 
issue, M-PoWerStore "pre-writes" the MACs in the STORE 
round, enabling the client to repair the broken MACs of a 
selected candidate. We point out that the adversary cannot 
forge the MACs (and therefore bypass the PoW) before the 
start of the COMPLETE, since the constructed MACs, besides 
the timestamp, also include the nonce's hash. 

To support multiple-writers, WRITE in M-PoWerStore 
comprises an additional clock synchronization round, called 
CLOCK, which is prepended to STORE. The READ performs 
an additional round called REPAIR, which is appended to 
COLLECT. The purpose of REPAIR is to repair and write-back 
candidates if necessary, and is invoked only under active attack 
by a malicious adversary that actually corrupts candidates. 

Similarly to PoWerStore, the server maintains the variable 
Hist to store the history of the data written by the writer 
in the STORE round, indexed by timestamp. In addition, the 
server keeps the variable Ic to store the timestamp of the last 
completed write. 

B. WRITE Implementation 

The full WRITE implementation is given in Algorithm |4] 
In the following, we simply highlight the differences to 
PoWerStore. 

As outlined before, M-PoWerStore is resilient to the ad- 
versary skipping timestamps. This is achieved by having the 
writer authenticate the timestamp of a WRITE with a key kw 
shared among the writers. Note that such a shared key can be 
obtained by combining the different group keys; for instance, 

kw ^ H{ki\\k2\\...). 

To obtain a timestamp, in the CLOCK procedure, the writer 
retrieves the timestamp (held in variable Ic) from a quorum of 
S — t servers and picks the highest timestamp ts with a valid 
MAC. Then, the writer increases ts and computes a MAC for 
ts using kw- Finally, CLOCK returns ts. 

To write a value V, the writer, (i) obtains a timestamp 
ts from the CLOCK procedure, (ii) authenticates ts and the 
nonce's hash by a vector of MACs vec, with one en- 
try for each server Si using group key ki, and (Hi) stores 
vec both in STORE and COMPLETE. Upon reception of a 
STORE(/ri, cc, A^, uec) message, the server writes the tuple 
{fri,cc,N,vec) into the history entry Hist[ts]. Upon recep- 
tion of a COMPLETE (ts, A^, wee) message, the server changes 
the value of Ic to (ts, N, vec) if ts > lets. 

C. READ Implementation 

The full READ implementation is given in Algorithm |3] 
The READ consists of three consecutive rounds, COLLECT, 
FILTER and REPAIR. In COLLECT, a client reads the candidate 
triple {ts, K, vec) stored in variable Ic in the server, and inserts 
it into the candidate set C together with the candidates read 
from other servers. After the client receives S — t candidates 
from different servers, COLLECT returns. 



82: Definitions: 

83: Q: set of pid, (process id) initially 

84: ts: structure (nttm,pid,MAC{fe,,,,}(nitmj|pid)), 
initially tso = (0, 0, NULL) 

85: operation WRITE(1/) 

86: Q ^ 

87: ts CL0CK() 

88: TV ^{0,1}^ 

89: TV ^ H{N) _ 

90: vec ^ [MAC{kiy{ts\\N)i<i<s] 

91: STORE{ts,V,N,vec) 

92: COMPLETE(ts, TV, vec) 

93: return OK 

94: procedure clock() 

95: send CLOCK(ts) to all servers 

96: wait until \Q\ > S - t 

97: ts.num ts.num + 1 

98: ts {ts.num, 'W,MAC{]^^y}(ts.num\\'w)) 

99: return ts 

100: upon receiving CLOCK_ACK(ts, tSi) from server Si 

101: Q^Q\j{i} 

102: if tSi > ts A verify(tSi, kw) then ts tSi 

103: procedure STORE(ts, V, TV, vec) 

104: {fri,...,frs} -i^ encode{V,t + 1, S) 

105: cc^[H{fn),...,H(frs)] _ 

106: foreacli server Si send STORE{ts, frt, cc, TV, vec) to s; 

107: wait for STORE_ACK(ts) from S — t servers 

108: procedure coMPLETE(ts, TV, vec) 

109: send COMPLETE{ts, TV, vec) to all servers 

110: wait for COMPLETE_ACK(ts) from S — t servers 



Algorithm 4: Algorithm of writer w in M-PoWerStore. 



In FILTER, the client submits C to each server. Upon recep- 
tion of C, the server chooses a candidate c^b to write-back, 
as the candidate with the highest timestamp in C such that 
valid(e„,b) holds, and sets Ic to c^b if Cwb-ts > lets. Roughly 
speaking, the predicate valid(e) holds if the server, verifies 
the integrity of the timestamp c.ts and nonce c.TV either by 
the MAC, or by the corresponding history entry. Besides that, 
the server chooses a candidate c,.t to return, as the candidate 
with the highest timestamp in C such that validByHist(ert) 
holds. The predicate validByHist(e) holds, if the server keeps 
a matching history entry for c. The server then responds to 
the client with a message including the timestamp Crt-ts, the 
fragment Hist[crt-ts].fr, the cross-checksum Hist[crt-ts].cc 
and the vector of MACs Hist[crt-ts].vec. 

The client waits for the responses from servers until there is 
a candidate e with the highest timestamp in C such that safe(e) 
holds, or until C is empty, after which FILTER returns. The 
predicate safe(e) holds if at least t+1 different servers Si have 
responded with timestamp c.ts, fragment /r^, cross-checksum 
ee such that H{fri) — cc[i], and vector vec. If C is empty, 
the client sets V to the initial value _L. Otherwise, the client 
selects the highest candidate c G C and restores value V by 
decoding V from the t + 1 correct fragments received for c. 

In REPAIR, the client verifies the integrity of c.vec by 
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Ill 

112; 
113 



Definitions: 

Ic: structure {ts,N,vec), initially co = (tso, NULL, NULL) 

Hist[. . .]: vector of {fr,cc,N,vec) indexed by ts, initially Hist[tso] 



(null, null, null, null) 



111: upon receiving CLOCK(ts) from writer w 
112: send CLOCK_ACK{ts, lets) to writer w 

113: upon receiving STORE{ts, fr, cc, N, vec) 
from writer w 

114: 
115: 



116 
117 
118 



Hist[ts] <— {fr,cc,N,vec) 
send STORE_ACK{ts) to writer w 

upon receiving COMPLETE{ts, A^, vec) from writer w 
if ts > lets then Ic <— {ts, N, vec) 
send COMPLETE_ACK(ts) to writer w 



119: upon receiving COLLECT(tsr) from client r 
120: send COLLECT_ACK{tsr, Ic) to client r 



121: 
122: 
123: 
124: 

125: 
126: 
127: 

128: 
129: 
130: 

131 
132: 
133: 



upon receiving FILTER (tsr, C) from client r 

C-wb, Crt Co 

c^t <~ ce C : c=max({c G C : valid(c)}) 

if Cwb-ts > lets then Ic c^b //write-back 

Crt ^ c€ C : c=max({c e C : validByHist(c)}) 

{fr, cc, vec) ^ {Hist[crt-ts].fr, Hist[crt-ts].cc, Hist[crt-ts].vec) 

send FILTER_ACK (tsr, Crt-ts, fr, cc, vec) to client r 

upon receiving REPAIR(fsr, c) from client r 
if c.ts > lets A valid(c) then Ic -(^ c 
send REPAIR_ACK(isr) to client r 

Predicates: 

valid(c) = (validByHist(c) V venfy{c.vec\i], c.ts, H{eN), k^)) 
validByHist(c) = {H{c.N) ^ Hist\ets].N) 



Algorithm 5: Algorithm of server Si in M-PoWerStore. 



matching it against the vector vec received from t + 1 dif- 
ferent servers. If c.vec and vec match then REPAIR returns. 
Otherwise, the client repairs c by setting c.vec to vec and 
invokes a round of write-back by sending a REPAlR(tsr, c) 
message to all servers. Upon reception of such a message, 
the server sets Ic to c if c.ts > lets and valid(c) holds and 
responds with an REPAIR ACK message to the client. Once the 
cUent receives acknowledgements from n~t different servers, 
REPAIR returns. After REPAIR returns, the READ returns V. 

D. Correctness Sketch 

We show that M-PoWerStore satisfies READ lineaiizability 
as follows. We show that if a completed READ rd returns 
V then a subsequent READ rd' does not return a value 
older than V. Note that the arguments for READ/WRITE 
linearizability and wait-freedom are very similar to those of 
PoWerStore (Section |III-E| i, and therefore omitted. 

Suppose a READ rd' by a correct client follows after rd 
that returns V. If c is the candidate selected in rd, we argue 
that rd' does not select a candidate with a timestamp lower 
than c.ts. Note that besides c being a valid candidate, in 
REPAIR, the client also checks the integrity of c.vec. If c.vec 
passes the integrity check in rd (line 171 of Algorithm |6|l, 
then the integrity of c has been fully established. Otherwise, 
c.vec fails the integrity check. In that case the client repairs c 



(line 172 of Algorithm [6]) and subsequently writes-back c to 
t + 1 correct servers. In both cases, t + 1 correct servers have 
set Ic to c or to a valid candidate with a higher timestamp. As 
such, during COLLECT in rd', a valid candidate c' such that 
c' .ts > c.ts is included in C. Since c' is valid, t+1 correct 
servers hold history entries matching c' and none of them 
responds with a timestamp lower than c' .ts. Consequently, 
at most 2t timestamps received by the client in FILTER are 
lower than c' .ts and thus c' is never excluded from C. By 
Algorithm |6] rd' does not select a candidate with timestamp 
lower than c' .ts > c.ts. 



V. IMPLEMENTATION & EVALUATION 

In this section, we describe an implementation modeling 
a Key-Value Store (KVS) based on M-PoWerStore. More 
specifically, to model a KVS, we use multiple instances 
of M-PoWerStore referenced by keys. We then evaluate the 
performance of our implementation and we compare it both: 
(i) M-ABD, the multi- writer variant of the crash-only atomic 
storage protocol of |6|, and (ii) Phalanx, the multi -writer 
robust atomic protocol of ||39J. 

A. Implementation Setup 

Our KVS implementation is based on the JAVA-based 
framework Neko ||2J that provides support for inter-process 
communication, and on the Jerasure |]T| library for constructing 
the dispersal codes. To evaluate the performance of our M- 
PoWerStore we additionally implemented two KVSs based 
on M-ABD and Phalanx. 

In our implementation, we relied on 160-bit SHAl for 
hashing purposes, 160-bit keyed HMACs to implement MACs, 
and 1024-bit DSA to generate signatures. For simplicity, 
we abstract away the effect of message authentication in 
our implementations; we argue that this does not affect our 
performance evaluation since data origin authentication is 
typically handled as part of the access control layer in all three 
implementations, when deployed in realistic settings (e.g., in 
Wide Area Networks). 

We deployed our implementations on a private network 
consisting of a 4-core Intel Xeon E5443 with 4GB of RAM, 
a 4 core Intel i5-3470 with 8 GB of RAM, an 8 Intel-Core 
17-37708 with 8 GB of RAM, and a 64-core Intel Xeon 
E5606 equipped with 32GB of RAM. In our network, the 
communication between various machines was bridged using 
a 1 Gbps switch. All the servers were running in separate 
processes on the Xeon E5606 machine, whereas the clients 
were distributed among the Xeon E5443, 17, and the 15 
machines. Each client invokes operation in a closed loop. 
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134 


Definitions: 


Parameter 


Default Value 


135 


tsr: num, initially 


Failure threshold t 


1 


136 


Q, R: set of pid, initially 


File size 


256 KB 


137 


C: set of {ts,N,vec), initially 


Probability of Concurrency 


1% 


138 


14'[1...6]: vector of (ts, jr,cc,vec), initially y 


Workload Distribution 


100% READ /100% WRITE 


139 


operation READ() 


TABLE 1 


140 


C,Q,i?^ 


Default parameters used in evaluation. 


141 


tsr <— tsr + 1 






142 


C COLLECT(tsr) 







C FILTER (fsr, C) 

if C / then 

c' e C : highcand(c') A safe(c') 

V <— RESTORE(c.ts) 

repair(c) 
else V ^ _L 
return V 

procedure coLLECT(tsr) 

send COLLECT(isr) to all servers 
wait until \Q\ > S - t 
return C 

upon receiving COLLECT_ACK{tsr, Ci) from server Si 

Q^Qu{i} 

if Ci.ts > tso then C C U {ci} 

procedure FiLTER(fsr, C) 

send FILTER{tsr, C) to all servers 
wait until \R\> S -t A 
((3c e C : highcand(c) a safe(c)) v C = 0) 
return C 

upon receiving FlLTER_ACK{tsr ,ts ,fr ,cc,vec) from server Si 
R-i^ RU{i}; W[i] ^ {ts, fr, cc, vec) 
C ^ C\{c£C : invalid(c)} 

procedure RESTORE(ts) 

cc ^ cc' s.t. 3R' (ZR: |7?'| >t + lf\ 
(Vi G R' : W\i].ts = tsA W\i].cc = cc') 
F^{W[{\.fr : ieRAW[i].ts=tsAH{W[i\.fr)=cc\i]} 
V ^ decode(F,f + 1,5) 
return V 

procedure repair(c) 

vec vec' s.L 3i?' C i? : \R'\ > t + 1 A 
(Vi G R' : W[i].ts = c.ts A W[i].vec = vec') 
if c.iiec 7^ «ec then 

c.vec -h- vec //repair 
send REPAIR{tsr, c) to all servers 
wait for REPAIR_ACK(tsr) from S — t servers 

Predicates: 

highcand(c) = (c.ts = max({c'.ts : c G C})) 

safe(c) ^ 3i?' c i? : |i?'| > t + 1 A 
(Vi G i?' : W"[i].ts = c.ts) A 

(Vi,j G R':W[i\.cc=W[j].cc A H{W[{\.fr)=W[j].cc[{]) /\ 
(Vi, j G J?' : W[i].vec = W[j].i;ec) 

178: invalid(c) = |{i G 7? : W[i].ts < c.ts}\ > S -t 

Algorithm 6: Algorithm of client r in M-PoWerStore. 



i.e., a client may have at most one pending operation. In all 
KVS implementations, all WRITE and READ operations are 
served by a local database stored on disk. 

We evaluate the peak throughput incurred in M-PoWerStore 
in WRITE and READ operations, when compared to M-ABD 
and Phalanx with respect to: (i) the file size (64 KB, 128 KB, 
256 KB, 512 KB, and 1024 KB), and (ii) to the server failure 



threshold t (1, 2, and 3, respectively). For better evaluation 
purposes, we vary each variable separately and we fix the 
remaining parameters to a default configuration (Table |l|. 
We also evaluate the latency incurred in M-PoWerStore with 
respect to the attained throughput. 

We measure peak throughput as follows. We require that 
each writer performs back to back WRITE operations; we 
then increase the number of writers in the system until the 
aggregated throughput attained by all writers is saturated. The 
peak throughput is then computed as the maximum aggregated 
amount of data (in bytes) that can be written/read to the servers 
per second. 

Each data point in our plots is averaged over 5 inde- 
pendent measurements; where appropriate, we include the 
corresponding 95% confidence intervals, as data objects. On 
the other hand, READ operations request the data pertaining 
to randomly-chosen keys. For completeness, we performed our 
evaluation (i) in the Local Area Network (LAN) setting com- 



prising our aforementioned network (Section V-B i and ( ii) in 
a simulated Wide Area Network (WAN) setting (Section jV^ i. 
Our evaluation results are presented in Figure |2] 

B. Evaluation Results within a LAN Setting 

Figure |2(a)| depicts the latency incurred in M-PoWerStore 
when compared to M-ABD and Phalanx, with respect to the 
achieved throughput (measured in the number of operations 
per second). Our results show that, by combining meta- 
data write-backs with erasure coding, M-PoWerStore achieves 
lower latencies than M-ABD and Phalanx for both READ and 
WRITE operations. As expected, READ latencies incurred 
in PoWerStore are lower than those of WRITE operations 
since a WRITE requires an extra communication round cor- 
responding to the CLOCK round. Furthermore, due to PoW 
and the use of lightweight cryptographic primitives, the READ 
performance of PoWerStore considerably outperforms M-ABD 
and Phalanx. On the other hand, writing in M-PoWerStore 
compares similarly to the writing in M-ABD. 



Figure 2(b) depicts the peak throughput achieved in M- 
PoWerStore with respect to the number of Byzantine servers 
t. As t increases, the gain in peak throughout achieved in 
M-PoWerStore's READ and WRITE increases compared to 
M-ABD and Phalanx. This mainly stems from the reliance 
on erasure coding, which ensures that the overhead of file 
replication among the servers is minimized when compared to 
M-ABD and Phalanx. In typical settings, featuring t — 1 and 
the default parameters of Table [l] the peak throughput achieved 
in M-PoWerStore's READ operation is almost twice as large 
as that in M-ABD and Phalanx. 
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(d) Latency vs the failure threshold in a simulated WAN setting. 
Fig. 2. Evaluation Results. Data points are averaged over 5 independent runs; where appropriate, we include the corresponding 95% confidence intervals. 



In Figure 2(c) we measure the peak throughout achieved 
in M-PoWerStore with respect to the file size. Our findings 
clearly show that as the file size increases, the performance 
gain of M-PoWerStore compared to M-ABD and Phalanx 
becomes considerable. For example, when the file size is 1 
MB, the peak throughput of READ and WRITE operations 
in M-PoWerStore approaches the (network-limited) bounds of 
50 MB/^ and 45 MB/s, respectively. 

C. Evaluation Results within a Simulated WAN Setting 

We now proceed to evaluate the performance of M- 
PoWerStore when deployed in WAN settings. For that purpose, 
we rely on a 100 Mbps switch to bridge the network outlined 
in Section V-A and we rely on NetEm [41] to emulate the 
packet delay variance specific to WANs. More specifically, 
we add a Pareto distribution to our links, with a mean of 20 
ms and a variance of 4 ms. 

We then measure the latency incurred in M-PoWerStore 
in the emulated WAN setting. Our measurement results (Fig- 

-Note that an effective peak throughout of 50MB/s in M-PoWerStore 
reflects an actual throughput of almost 820 Mbps when t = 1. 



ure 2(d) i confirm our previous analysis conducted in the LAN 
scenario and demonstrate the superior performance of M- 
PoWerStore compared to M-ABD and Phalanx in realistic 
settings. Here, we point out that the performance of M- 
PoWerStore incurred in both READ and WRITE operations 
does not deteriorate as the number of Byzantine servers in the 
system increases. This is mainly due to the reliance on erasure 
coding. In fact, the overhead of transmitting an erasure-coded 
file F to the 3t + 1 servers, with a reconstruction threshold of 
i + 1 is given by It is easy to see that, as t increases, 

this overhead is asymptotically increases towards 3|i^|. 



VI. Related Work 

A seminal crash-tolerant robust linearizable read/write stor- 
age implementation assuming a majority of correct processes 
was presented in |6|. In the original single-writer variant of 
read operations always take 2 rounds between a client and 
servers with data write-backs in the second round. On the other 
hand all write operations complete in a single round; in the 
multi-writer variant p5) , the second write round is necessary. 
Server state modifications by readers introduced by |6J are 
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unavoidable; namely, p6| showed a < + 1 lower bound on the 
number of servers that a reader has to modify in any wait-free 
linearizable storage. However, robust storage implementations 
differ in the strategy employed by readers: in some protocols 
readers write-back data (e.g., p4l, f6l, fT4l, flS), f^, l|39)) 
whereas in others readers only write metadata to servers (e.g., 
(10), (15), (16)). 

The only two asynchronous storage protocols that feature 
metadata write-backs are multi-writer crash-tolerant protocols 
of |T6) and flOj that are both also linearizable, wait-free and 
optimally resilient. The read/write protocol of fails to 
achieve optimal latency featuring 3-round writes and reads. 
Vivace |10| is a key-value storage system tailored for WANs 
(geo-replication); it features 3-round reads and 2-round writes, 
saving on a communication round by relying on synchronized 
clocks (NTP, GPS), which are used as counters. In comparison, 
PoWerStore features optimal latency without synchronized 
clocks and is the first protocol to implement metadata write- 
backs while tolerating Byzantine failures. 

Data write-backs are also not needed in case of fast robust 
storage implementations that feature single round reads and 
writes | [T5) . Namely, |T5) presents fast single- writer crash- 
tolerant and BFT storage implementations in which readers 
only write metadata while reading data in the single round of 
read and hence, without any write-back. However, fast imple- 
mentations are fundamentally limited and cannot be optimally 
resilient, since the number of required servers is inherently 
linear in number of readers (15). The limitation on the number 
of readers of |15'| was relaxed in fTSl, where a single-writer 
crash-tolerant robust linearizable storage implementation was 
presented, in which most of the reads complete in a single 
round, yet a fraction of reads is permitted to be "slow" and 
complete in 2 rounds. 

Clearly, most BFT storage implementations have been fo- 
cusing on using as few servers as possible, ideally 3^+1, which 
defines optimal resilience in the Byzantine model [40] . This is 
achieved by Phalanx (39) , a BFT variant of (6). Phalanx uses 
digital signatures, i.e., self-verifying data, to port (6) to the 
Byzantine model, maintaining the latency of (6|, as well as its 
data write-backs. However, since digital signatures introduce 
considerable overhead [ [37) , (42^, research attention has shifted 
from protocols that employ self-verifying data (5), [ [34 J , [38] , 
(39) to protocols that feature lightweight authentication, or no 
data authentication at all (unauthenticated model). 

In the unauthenticated model, [3| ruled out the existence 
of optimally resilient robust Byzantine fault-tolerant storage 
implementation where all write operations finish after a single 
communication round. This explained the previous difficulties 
in reaching optimal resilience in unauthenticated BFT storage 
implementations where several protocols have used At + 1 
servers |7), |19|. Furthermore, [14] showed the impossibil- 



ity of reading from a robust optimally resilient linearizable 
storage in two communication rounds; in addition, if WRITE 
operations perform a constant number of rounds, even reading 
in three rounds is impossible (14) . These results imply that the 
optimal latency of a robust optimally resilient and linearizable 



BFT storage in the unauthenticated model is 2 rounds for 
writes and 4 rounds for reads, even in the single writer case. 
This can be achieved by the regular-to-linearizable transforma- 
tion of the regular (31) storage protocol of (2T) . Hence, it is 
not surprising that other robust BFT storage protocols in the 
unauthenticated model focused on optimizing common-case 
latency with either an unbounded number of read rounds in 
the worst case (20) , (22^ or a number of read rounds dependent 
on the number of faulty processes t [40J . 

Clearly, there is a big gap between storage protocols that 
use self-verifying data and those that assume no authentication. 
Loft [24] aims at bridging this gap and implements erasure- 
coded optimally resilient linearizable storage while optimizing 
the failure-free case. Loft uses homomorphic fingerprints and 
MACs; it features 3-round wait-free writes, but reads are based 
on data write-backs and are only obstruction-free (26) , i.e., 
the number of read rounds is unbounded in case of read/write 
concurrency. Similarly, our Proofs of Writing (PoW) incor- 
porate lightweight authentication that is, however, sufficient 
to achieve optimal latency and to facilitate metadata write- 
backs. We find PoW to be a fundamental improvement in the 
light of BFT storage implementations that explicitly renounce 
linearizability in favor of weaker regularity due to the high 
cost of data write-backs ]8[. 

VII. Concluding Remarks 

In this paper, we presented PoWerStore, the first efficient 
robust storage protocol that achieves optimal latency, measured 
in maximum (worst-case) number of communication rounds 
between a client and storage servers, without resorting to 
digital signatures. We also separately presented a multi-writer 
variant of our protocol called M-PoWerStore. The efficiency of 
our proposals stems from combining lightweight cryptography, 
erasure coding and metadata writebacks, where readers write- 
back only metadata to achieve linearizability. While robust 
BFTs have been often criticized for being prohibitively ineffi- 
cient, our findings suggest that efficient and robust BFTs can 
be realized in practice by relying on lightweight cryptographic 
primitives without compromising worst-case performance. 

At the heart of both PoWerStore and M-PoWerStore proto- 
cols are Proofs of Writing (PoW): a novel storage technique 
inspired by commitment schemes in the flavor of [j23J, that 
enables PoWerStore to write and read in 2 rounds in case 
of the single-writer PoWerStore which we show optimal. 
Similarly, by relying on PoW, M-PoWerStore features 3-round 
writes/reads where the third read round is only invoked under 
active attacks. Finally, we demonstrated M-PoWerStore's su- 
perior performance compared to existing crash and Byzantine- 
tolerant atomic storage implementations. 

We point out that our protocols assume unbounded storage 
capacity to maintain a version history of the various updates 
performed by the writers in the system. We argue, however, 
that this limitation is not particular to our proposals and is 
inherent to all solutions that rely on erasure-coded data [46]. 
Note that prior studies have demonstrated that it takes several 
weeks to exhaust the capacity of versioning systems (47); in 
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case the storage capacity is bounded, the servers can rely on 
efficient garbage collection mechanisms |19 | to avoid possible 
exhaustion of the storage system. 
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Appendix 
A. Correctness of PoWerStore 

Definition A.l (Valid candidate). A candidate c is valid ;/ 
valid(c) holds at some correct server 

Definition A.l (Timestamps of operations). A read opera- 
tion rd by a correct reader has timestamp ts iff the reader in 
rd selected c in line 55 such that c.ts = ts. A WRITE operation 



wr has timestamp ts iff the writer increments its timestamp to 
ts in line |4] 

Lemma A.3 (Validity). Let rd be a completed READ by a 
correct reader // rd returns value V ^ 1. then V was written. 

Proof: We show that if V is the value decoded in line |76] 
then V was indeed written. To show this, we argue that the 
fragments used to decode V were written. Note that prior to 
decoding V from a set of fragments, the reader establishes the 
correctness of each fragment as follows. First, in line |74] the 
reader chooses a cross-checksum that was received from t+1 
servers. Since one of these servers is correct, the chosen cross- 
checksum was indeed written. Secondly, the reader checks in 
line 75 that each of the t+1 fragments used to decode V 



hashes to the corresponding entry in the cross-checksum. By 
the collision-resistance of H, all fragments that pass this check 
were indeed written. Therefore, if V is the value decoded from 
these fragments, we conclude that V was written. □ 

Lemma A.4 (Proofs of Writing). If c is a valid candidate, 
then there exists a set Q of t + 1 correct servers such that 
each server Si £ Q changed Hist[c.ts] to {fri^cc,H(c.N)). 



Proof: If c is valid, then by Definition |A.1| valid(c) is true 
at some correct server Sj. Hence, H{c.N) = Hist[c.ts].NH 
holds at Sj. By the pre-image resistance of H, no compu- 
tationally bounded adversary can acquire c.N from the sole 
knowledge of H{c.N). Hence, c.N stems from the writer in 
a WRITE operation wr with timestamp c.ts. By Algorithm [T] 
line |8] the value of c.N is revealed after the STORE phase in 
wr completed. Hence, there exists a set Q of t + 1 correct 
servers such that each server e Q changed Hist\c.ts\ to 
{fr,,cc,H{c.N)). □ 

Lemma A.5 (No exclusion). Let c be a valid candidate and 
let rd be a READ by a correct reader that includes c in C 
during COLLECT. Then c is never excluded from C. 



Proof: As c is valid, by Lemma A.4 a there exists a set Q 
of t + 1 correct servers such that each server € Q changed 
Hist[c.ts\ to *, i/(c.iV)). Hence, valid(c) is true at every 
server in Q. Thus, no server in Q replies with a timestamp 
ts < c.ts in line |4T] Therefore, at most S — t — 1 = 2t 
timestamps received by the reader in the FILTER phase are 
lower than c.ts, and so c is never excluded from C. □ 



Lemma A.6 (READ/WRITE Atomicity). Let rd be a com- 
pleted READ by a correct reader ff rd follows some complete 
WRITE(y), then rd does not return a value older than V. 

Proof: If ts is the timestamp of WRlTE(y), it is sufficient to 
show that the timestamp of rd is not lower than ts. To prove 
this, we show that 3c' E C such that (i) c' .ts > ts and (ii) c' 
is never excluded from C. 

By the time WRlTE(y) completes, t+1 correct servers hold 
in Ic a candidate whose timestamp is ts or greater. According 
to lines 26 3 1 of Algorithm [2j a correct server never changes 
Ic to a candidate with a lower timestamp. Hence, when rd is 
invoked, t + 1 correct servers hold candidates with timestamp 
ts or greater in Ic. Hence, during the COLLECT phase in rd, 
some candidate received from a correct server with timestamp 
ts or greater is inserted in C. Such a candidate is necessarily 
valid because either the server received it directly from the 



writer, or the server checked its integrity in line 30 Let c' 
be the valid candidate with the highest timestamp in C. Then 
by Lemma A.5 c' is never excluded from C. By line 55 no 



candidate c such that c.ts < c' .ts is selected. Since c'.ts > ts, 
no candidate with a timestamp lower than ts is selected in rd. 

□ 

Lemma A.7 (READ atomicity). Let rd and rd' be two 

completed read operations by correct readers. If rd' follows 
rd that returns V, then rd' does not return a value older than 
V. 

Proof: If c is the candidate selected in rd, it is sufficient 
to show that the timestamp of rd' is not lower than c.ts. We 
argue that C contains a candidate c' such that (i) c' .ts > c.ts 
and (ii) c' is never excluded from C. 

By the time rd completes, t + 1 correct servers hold c in 
LC. As c was selected in rd in line |55] some correct server 
asserted that c is valid in line 30 According to Algorithm |2] 
if a correct server excludes c from LC in line [32] then the 
server changed Ic to a valid candidate with timestamp c.ts or 
greater in line 3 1 Consequently, t+1 correct servers hold in 



LCU{lc} a valid candidate with timestamp c.ts or greater As 
such, during COLLECT in rd', a valid candidate c' such that 
c'.ts > c.ts is included in C, and by Lemma A.5 c' is never 



excluded from C. By line [55] no candidate with a timestamp 
lower than c'.ts is selected. Since c'.ts > c.ts, no candidate 
with a timestamp lower than c.ts is selected in rd' . □ 

Theorem A.8 (Atomicity). Algorithms^^and^are atomic. 

Proof: This proof follows directly from Lem- 
mas |A3]|A21E2l □ 

We now proceed to proving wait-freedom. 

Theorem A.9 (Wait-freedom). Algorithms |7] |2] and [5] are 

wait -free. 
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Proof: We show that no operation invoked by a correct 
client ever blocks. The wait-freedom argument of the WRITE is 
straightforward; in every phase, the writer awaits acks from the 
least number S~t of correct servers. The same argument holds 
for the COLLECT phase of the READ. Hence, in the remainder 
of the proof, we show that no READ blocks in the FILTER 
phase. By contradiction, consider a READ rd by reader r that 
blocks during the FILTER phase after receiving FILTER_ACK 
messages from all correct servers. We distinguish two cases: 
(Case 1) C includes a valid candidate and (Case 2) C includes 
no valid candidate. 

• Case 1: Let c be the highest valid candidate included in 
C. We show that highcand(c) A safe(c) holds. Since c 
is valid, by Lemma |A.4| there exists a set Q of t + 1 
correct servers such that each server Si G Q changed 
Hist[c.ts] to {fri,cc, H{c.N)). Thus, during the FILTER 
phase, valid(c) holds at every server in Q. As no valid 
candidate in C has a higher timestamp than c, (i) all 
servers Si E Q (at least t+1) responded with timestamp 
c.ts, corresponding erasure coded fragment /r^, cross- 
checksum cc in line [41] and (ii) all correct servers (at 
least S — t) responded with timestamps at most c.ts. By 
(i), c is safe. By (ii), every c' E C such that c' .ts > c.ts 
became invalid and was excluded from C, implying that 
c is Iniglncand. 

• Case 2: Here, we show that C = 0. As none of the 
candidates in C is valid, during the FILTER phase, the 
integrity check in line [30] failed for every candidate in 
C at all correct servers. Hence, at least S — t servers 
responded with timestamp ts^. Since tsQ is lower than 
any candidate timestamp, all candidates were classified 
as invalid and were excluded from C. 

□ 



Proof: We show that if V is the value decoded in line |167| 
then V was indeed written. To show this, we argue that the 
fragments used to decode V were written. Note that prior to 
decoding V from a set of fragments, the reader establishes the 
correctness of each fragment as follows. First, in line [165] the 
reader chooses a cross-checksum that was received from < + 1 
servers. Since one of these servers is correct, the chosen cross- 
checksum was indeed written. Secondly, the reader checks in 
line 166 that each of the t + 1 fragments used to decode V 



hashes to the corresponding entry in the cross-checksum. By 
the collision-resistance of H, all fragments that pass this check 
were indeed written. Therefore, if V is the value decoded from 
these fragments, we conclude that V was written. □ 

Lemma A.14 (WRITE atomicity). Let op be a completed 
operation by a correct client and let wr be a completed WRITE 
such that op precedes wr. IftSop and ts^r ore the timestamps 
of op and wr respectively, then tSwr > tSop- 

Proof: By the time op completes, t + 1 correct servers hold in 
Ic a candidate whose timestamp is tSop or greater. According 
to lines 117 124 129 of Algorithm [5] a correct server never 
updates Ic with a candidate that has a lower timestamp. 
Hence, the writer in wr obtains from the CLOCK procedure a 
timestamp that is greater or equal to tSop from some correct 
server Si. Let c be the candidate held in Ic by server Si, and let 
c.ts be the timestamp reported to the writer. We now argue that 
c.ts is not fabricated. To see why, note that prior to overwriting 
Ic with c in line 124 (resp. 129 1, server Si checks that c is 
valid in line 123 (resp. 129 1. The valid predicate as defined 
in line 132 subsumes an integrity check for c.ts. Hence, c.ts 



passes the integrity check in line 102| according to the WRITE 
algorithm, ts^^ > {c.ts.num + 1, *, *) > c.ts > tSop. □ 



Theorem A.IO (Latency). Algorithms [7] [2] and ^feature a 
latency of two communication rounds for the WRITE and two 
for the READ. 

Proof: By Algorithm [T] the WRITE completes after two 
phases, STORE and COMPLETE, each taking one communi- 
cation round. By Algorithm [3| the READ completes after two 
phases, COLLECT and FILTER, each incurring one communi- 
cation round. □ 

B. Correctness of M-PoWerStore 

Definition A.ll (Valid candidate). A candidate c is valid iff 
valid(c) is true at some correct server 

Definition A.12 (Timestamps of operations). A read op- 
eration rd by a non-malicious reader has timestamp ts iff 
the reader in rd selected c in line \145\ such that c.ts — ts. 
A WRITE operation wr has timestamp ts iff the CLOCK 
procedure in wr returned ts in line \87\ 

Lemma A.13 (Validity). Let rd be a completed READ by a 
correct reader ff rd returns value V ^ 1. then V was written. 



Lemma A.15 (Proofs of Writing). If c is a valid candidate, 
then there exists a set Q of t + 1 correct servers such that each 
server si E Q changed Hist[c.ts\ to {fri,cc,H{c.N),vec). 



Proof: If c is valid, then by Definition [A.ll[ valid(c) is 
true at some correct server Sj. Hence, either H{c.N) — 
Hist[c.ts].NH or verify (c.uec[j], c.ts, H{c.N), kj) must hold 
at Sj. By the pre-image resistance of H, no computationally 
bounded adversary can acquire c.N from the sole knowledge 
of H{c.N). Hence, c.N stems from some writer in a WRITE 
operation wr with timestamp c.ts. By Algorithm [4] line 92 
the value of c.N is revealed after the STORE round in wr 
completed. Hence, there exists a set Q of < + 1 correct 
servers such that each server Si E Q changed Hist[c.ts\ to 
(Jr,,cc,H{c.N),vec). □ 

Lemma A.16 (No exclusion). Let c be a valid candidate and 
let rd be a READ by a correct reader that includes c in C 
during COLLECT. Then c is never excluded from C. 



Proof: As c is valid, by Lemma A.15 a there exists a set Q 
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of i + 1 correct servers such that each server Si G Q changed 
Hist[c.ts] to {*,*,H{c.N),vec). Hence, validByHist(c) is 
true at every server in Q. Thus, no server in Q replies 
with a timestamp ts < c.ts in line |127| Therefore, at most 
S — t — 1 = 2t timestamps received by the reader in the 
FILTER round are lower than c.ts, and so c is never excluded 
from C. □ 

Lemma A.17 (READ/WRITE Atomicity). Let rd be a com- 
pleted READ by a correct reader If rd follows some complete 
WRITE(y), then rd does not return a value older than V. 

Proof: If ts is the timestamp of WRITE(V"), it is sufficient to 
show that the timestamp of rd is not lower than ts. To prove 
this, we show that 3c' € C such that (i) c' .ts > ts and (ii) c' 
is never excluded from C. 

By the time WRITE(V^) completes, t+1 correct servers hold 
in Ic a candidate whose timestamp is ts or greater. According 
to lines 117 124 129 of Algorithm |5] a correct server never 
changes Ic to a candidate with a lower timestamp. Hence, 
when rd is invoked, t+1 correct servers hold candidates with 
timestamp ts or greater in Ic. Hence, during COLLECT in rd, 
some candidate received from a correct server with timestamp 
ts or greater is inserted in C. Such a candidate is necessarily 



Consequently, in the COLLECT round in rd' a valid candidate 



c' such that c' .ts > c is included in C, and by Lemma A. 16 



valid by the integrity checks in lines 123 129 Let c' be the 
valid candidate with the highest timestamp in C. Then by 
Lemma A. 16 c' is never excluded from C. By line 145 



candidate c such that c.ts < c' .ts is selected. Since c'.ts > ts, 
no candidate with a timestamp lower than ts is selected in rd. 

□ 

Lemma A.18. (read atomicity). Let rd and rd' be two 

completed read operations by correct readers. If rd' follows 
rd that returns V, then rd' does not return a value older than 
V. 

Proof: If c is the candidate selected in rd, it is sufficient 
to show that the timestamp of rd' is not lower than c.ts. We 
argue that C contains a candidate c' such that (i) c' .ts > c.ts 
and (ii) c' is never excluded from C. 

As c is selected in rd in line 145 only if safe(c) holds. 



some correct server verified the integrity of c.ts and c.N in 
line |125| In addition, in REPAIR, the reader in rd checks the 
integrity of c.vec. We distinguish two cases: 

• Case 1: If c.vec passes the integrity check in line |171| 
then the integrity of c has been fully established. Hence, 
by the time rd completes, t + 1 correct servers validated 
c in line 123 and changed Ic to c or to a higher valid 
candidate. 

• Case 2: If vector c.vec fails the integrity check in 
line |171[ then in REPAIR, c is repaired in line 172 and 



subsequently written back to t+1 correct servers. Hence, 
by the time rd completes, t + 1 correct servers validated 



c' is never excluded from C. By line 145 no candidate with 
a timestamp lower than c' is selected. Since c' .ts > c.ts, no 
candidate with a timestamp lower than c.ts is selected in rd' . 

□ 

Theorem A.19 (Atomicity). Algorithms^^and^are atomic. 

Proof: This proof follows directly from Lem- 
mas IXT3]|AJ4llAT7llAl8l □ 

We now proceed to proving wait-freedom. 

Theorem A.20 (Wait-freedom). Algorithms |4] |5]fl«t/ 16| are 
wait -free. 

Proof: We show that no operation invoked by a correct 
client ever blocks. The wait-freedom argument of the WRITE is 
straightforward; in every round, the writer awaits acks from the 
least number S—t of correct servers. The same argument holds 
for the COLLECT and the REPAIR rounds of the READ. Hence, 
in the remainder of the proof, we show that no READ blocks 
in the FILTER round. By contradiction, consider a READ rd by 
reader r that blocks during the FILTER round after receiving 
FILTER ACK messages from all correct servers. We distinguish 
two cases: (Case 1) C includes a valid candidate and (Case 
2) C includes no valid candidate. 

• Case 1: Let c be the highest valid candidate included in 
C. We show that highcand(c) A safe(c) holds. Since c is 
valid, by Lemma p^.l5[ there exists a set Q of t+1 correct 
servers such that each server Si e Q changed Hist[c.ts] 
to {fri, cc, H{c.N), vec). Thus, during the FILTER round, 
validByHist(c) holds at every server in Q. As no valid 
candidate in C has a higher timestamp than c, (/) all 
servers Si G Q (at least t+1) responded with timestamp 
c.ts, corresponding erasure coded fragment /r^, cross- 



checksum cc and repair vector vec in line 127 and all 
correct servers (at least S — t) responded with timestamps 
at most c.ts. By c is safe. By every c' £ C such 
that c'.ts > c.ts became invalid and was excluded from 
C, implying that c is inigincand. 

Case 2: Here, we show that C = 0. As none of the 
candidates in C is valid, during FILTER, the integrity 
check in line |125| failed for every candidate in C at all 
correct servers. Hence, at least S — t servers responded 
with timestamp tsQ. Since tsQ is lower than any candidate 
timestamp, all candidates were classified as invalid and 
were excluded from C. 

□ 



c in line 129 and changed Zc to c or to a higher valid 
candidate. 



Theorem A.21 (Non-skipping Timestamps). A/gonf/iOT5|^|5] 
and^implement non-skipping timestamps. 

Proof: By construction, a fabricated timestamp would fail 



the check in line 102 Hence, no fabricated timestamp is ever 
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used in a WRITE. The Lemma then directly follows from the 
algorithm of WRITE. □ 



Theorem A.22 (Latency). Algorithms |4] |5] and ^feature a 
latency of three communication rounds for the WRITE and two 
for the READ in the absence of attacks. In the worst case, the 
READ latency is three communication rounds. 

Proof: By Algorithm |4] the WRITE completes after three 
rounds, CLOCK, STORE and COMPLETE, each taking one com- 
munication round. In the absence of attacks, by Algorithm [6] 
the READ completes after two rounds, COLLECT and FILTER, 
each taking one communication round. Under BigMac p3| 
attacks the READ may go to the REPAIR round, incurring one 
additional communication round. □ 
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